# Multimodal dialogue

Spatial LLaVA 7B Gguf
Apache-2.0
Spatial-LLaVA-7B is a multimodal model fine-tuned based on the LLaVA model, focusing on improving the ability of spatial relationship reasoning and suitable for multimodal research and chatbot development.
Text-to-Image Safetensors
S
rogerxi
252
1
Qwen3 8B NEO Imatrix Max GGUF
Apache-2.0
A NEO Imatrix quantized version based on the Qwen3-8B model, supporting 32K long context and enhanced reasoning ability
Large Language Model
Q
DavidAU
178
1
VL Rethinker 72B Mlx 4bit
Apache-2.0
4-bit quantized version of VL-Rethinker-72B, optimized for Apple devices using the MLX framework, supporting visual question answering tasks.
Text-to-Image English
V
TheCluster
14
0
Gemma 3 12b It GPTQ 4b 128g
This model is an INT4 quantized version of google/gemma-3-12b-it, using the GPTQ algorithm to reduce parameters from 16-bit to 4-bit, significantly decreasing disk space and GPU memory requirements.
Image-to-Text Transformers
G
ISTA-DASLab
1,175
2
Qwen2.5 VL 7B Instruct GPTQ Int4
Apache-2.0
Qwen2.5-VL-7B-Instruct-GPTQ-Int4 is an unofficial GPTQ-Int4 quantized version based on the Qwen2.5-VL-7B-Instruct model, supporting multimodal tasks from image-text to text.
Image-to-Text Transformers Supports Multiple Languages
Q
hfl
872
3
Llama3.1 Typhoon2 Audio 8b Instruct
Typhoon 2-Audio Edition is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs while simultaneously generating both text and speech outputs. The model is specifically optimized for Thai language while also supporting English.
Text-to-Audio Transformers Supports Multiple Languages
L
scb10x
664
9
Chatrex 7B
ChatRex is a perception-specialized multimodal large language model capable of associating answers with specific objects while responding to questions.
Image-to-Text English
C
IDEA-Research
825
14
Glm Edge V 5b
Other
GLM-Edge-V-5B is a 5-billion-parameter multimodal model that supports image and text inputs, capable of performing image understanding and text generation tasks.
Image-to-Text
G
THUDM
4,357
12
Glm Edge V 2b
Other
GLM-Edge-V-2B is an image-text-to-text model based on the PyTorch framework, supporting Chinese processing.
Image-to-Text
G
THUDM
23.43k
11
Mmduet
MIT
MMDuet is a VideoLLM model that supports real-time interaction during video playback, focusing on time-sensitive video understanding tasks.
Video-to-Text English
M
wangyueqian
69
4
Aria Sequential Mlp FP8 Dynamic
Apache-2.0
FP8 dynamically quantized model based on Aria-sequential_mlp, suitable for image-text-to-text tasks, requiring approximately 30GB VRAM.
Image-to-Text Transformers
A
leon-se
94
6
Qwen2 Vl Tiny Random
This is a small debugging model randomly initialized based on the configuration of Qwen2-VL-7B-Instruct, used for vision-language tasks.
Image-to-Text Transformers
Q
yujiepan
27
1
Internvideo2 Chat 8B HD
MIT
InternVideo2-Chat-8B-HD is a video understanding model that combines a large language model and VideoBLIP. It is constructed through a progressive learning scheme and can handle high-definition video input.
Video-to-Text Safetensors
I
OpenGVLab
190
16
Llava Llama 2 13b Chat Lightning Preview
LLaVA is an open-source multimodal chatbot model based on the Transformer architecture, obtained by fine-tuning LLaMA/Vicuna on multimodal instruction-following data generated by GPT.
Text-to-Image Transformers
L
liuhaotian
2,122
46
Blip2 Opt 2.7b 8bit
MIT
BLIP-2 is a vision-language pre-trained model that combines an image encoder and a large language model for image-to-text generation tasks.
Image-to-Text Transformers English
B
Mediocreatmybest
69
2
Blip2 Image To Text
MIT
BLIP-2 is a vision-language pre-trained model that achieves language-image pre-training guidance by freezing the image encoder and large language model.
Image-to-Text Transformers English
B
paragon-AI
343
27
Minigpt 4 LLaMA 7B
MiniGPT-4 is a multimodal model that combines visual and language capabilities and is developed based on the Vicuna language model.
Text-to-Image Transformers
M
wangrongsheng
1,777
18
Llava 13b V0 4bit 128g
LLaVA is a multimodal model combining vision and language, based on the LLaMA architecture, supporting image understanding and dialogue generation.
Text-to-Image Transformers
L
wojtab
167
79
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase